EDA
Individual EDA of Work Variations
Factor w/ 4 levels "[0,5.3]","(5.3,7.9]",..: 2 4 2 3 1 3 3 3 3 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 5.300 7.900 9.251 11.600 100.000 101


Factor w/ 4 levels "[0,23.7]","(23.7,31.7]",..: 3 1 2 2 4 2 1 4 2 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 23.70 31.70 33.23 41.80 100.00 105


Factor w/ 4 levels "[0,20.3]","(20.3,23.9]",..: 2 2 2 3 1 4 3 4 3 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 20.30 23.90 24.12 27.70 100.00 105


Factor w/ 4 levels "[0,14.1]","(14.1,18.3]",..: 2 4 4 3 2 2 4 1 2 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 14.10 18.30 19.65 24.00 100.00 105


Factor w/ 4 levels "[0,5.4]","(5.4,8.7]",..: 3 3 3 2 1 2 3 2 2 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 5.400 8.700 9.636 12.800 100.000 105


Factor w/ 4 levels "[0,7.7]","(7.7,12.3]",..: 3 4 3 3 3 3 3 1 3 4 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 7.70 12.30 13.36 17.80 100.00 105


Factor w/ 4 levels "[0,3.5]","(3.5,5.4]",..: 2 3 4 1 2 3 1 3 2 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 3.500 5.400 6.109 7.900 100.000 105


Next the seven variables for work variations (professional, production, unemployment, office, service, construction, self-employed) were assessed for normality. The boxplots that exhibited a decrease in income, as more of the specific work variation was included in the census tract, were unemployment, service, construction, and production. That is to say, as more unemployed individuals were accounted for in a given census tract, the income per capita decreased. The only work variation that exhibited an increase in average income was professional work. The remaining variables of office and self-employed remained relatively stable across quartiles. Looking at the histograms of each of the variables it appeared that only the proportion of professionals was distributed normally. The remaining six work variations were all skewed to the right. For professionals, the Q-Q plots affirmed the normality as the plot did not have the error terms straying far from the line with very small right and left tails. The same cannot be said for the other variables as each had an oversized right tail and a relatively small left tail. Overall the proportion of professionals appeared normally distributed while the other work variations did not.
Individual EDA of ethnicities
Factor w/ 4 levels "[0,0.8]","(0.8,4]",..: 3 4 4 2 4 3 4 3 3 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.80 4.00 13.78 15.32 100.00


Factor w/ 4 levels "[0,2.4]","(2.4,7.2]",..: 1 1 1 3 1 3 2 1 1 1 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.40 7.20 17.36 21.50 100.00


Factor w/ 4 levels "[0,0.1]","(0.1,1.2]",..: 2 3 3 1 3 1 1 1 1 2 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.100 1.200 4.347 4.400 91.300


Factor w/ 4 levels "[0,37.1]","(37.1,70.3]",..: 3 2 3 3 2 3 3 3 4 3 ...

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 37.10 70.30 61.24 88.40 100.00


Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.7567 0.4000 100.0000


Finally the five ethnic variables (Native, White, Black, Hispanic, and Asian) were investigated. The boxplots for White showed an increase in average income between the first second and third quartiles but no change in the fourth. The boxplot for Asian showed an increase from the first through the fourth quartile. The boxplots for Hispanic slightly increased between the first and second quartile but did not change for the third quartile. The fourth quantile for Hispanic decreased significantly. The boxplot for Black increased in average income between the first and second quartile. Then there was a decrease in average income from the second to the fourth quartiles. Overall, it appeared that average income did change based on concentration of ethnicities in a census tract. The histogram for White was bimodal with the highest frequency at over 8,000. The histograms for the other four ethnicities were skewed to the right. Based on the histogram, it appeared that white had the highest responses followed by Hispanic, Black, Asian, and Native. All of the error terms along the Q-Q plot line for each of the ethnicity variables followed a curve with large left and right tails. Also, there were not enough responses from the Native ethnicity to construct a meaningful boxplot. For the native Q-Q plot, there was a clear pattern of the error terms along the line implying non-normality. Therefore, based on the assessment of the boxplots, histograms, and Q-Q plots, none of the ethnicities appear normally distributed.
'data.frame': 69672 obs. of 11 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ IncomePerCap: int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
[1] 626
[1] 0
[1] 11
'data.frame': 69567 obs. of 11 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ IncomePerCap: int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
- attr(*, "na.action")= 'omit' Named int 1484 1807 2299 2499 2789 4259 4444 4448 4449 4477 ...
..- attr(*, "names")= chr "1514" "1851" "2370" "2574" ...
PCA

A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there were 10 variables to explain IncomePerCap. However, the correlation matrix showed notable correlation between some of the predictor variables. For example, Professional had notable correlations with Service, Construction, Production and Unemployment, White had notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.

The biplot above analyzed over 70k+ data points, resulting in the dense scattering of data. The axes of this plot were PC1 on the horizontal and PC2 on the vertical. PC1 had the most variation, between approximately -5 to 10, while PC2 went between -8 to 10. The variables White, Production, Unemployment, Black and Professional were pretty evenly split up between Pc1 and PC2. Other variables, such as Office, Service, Unemployed, Construction, etc. were majorly represented in either PC1 or PC2.
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion 0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
PC8 PC9 PC10
Standard deviation 0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion 0.99849 1.00000 1.000000
The breakdown of the variation explained by each component showed that just over 60% of the variation was accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component was similar. This was further illustrated by the following graph.

Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-57889 -3154 -136 3093 39355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.463 < 2e-16 ***
PC1 -4585.05 11.71 -391.701 < 2e-16 ***
PC2 -1454.29 15.63 -93.043 < 2e-16 ***
PC3 604.54 17.96 33.664 < 2e-16 ***
PC4 994.55 20.21 49.214 < 2e-16 ***
PC5 -878.20 23.56 -37.274 < 2e-16 ***
PC6 1377.18 25.44 54.140 < 2e-16 ***
PC7 -205.74 27.20 -7.564 3.96e-14 ***
PC8 -196.99 29.35 -6.712 1.93e-11 ***
PC9 3301.06 170.10 19.407 < 2e-16 ***
PC10 -3519.28 6262.21 -0.562 0.574
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.704e+04 on 10 and 69556 DF, p-value: < 2.2e-16
A full principle component regression was performed, and all except the last component were deemed significant. It was also notable that this regression explained 71.01% of the variance in the dataset according the the adjusted R-Squared. The strongest variable was of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.
A plot of the R-Square values over the number of components explained the amount of variation explained in the independent variable, IncomePerCap, based off of the components. The steeper increase and then petering off that occured in the R-Square graph seemed to indicate that a significant amount of the variation of the data in regards to IncomePerCap was explained using just the first component. Based on the initial analysis of the R-Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.
Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-42965 -4026 -442 3546 36606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 23.34 1121.0 <2e-16 ***
PC1 -4585.05 13.06 -351.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared: 0.6393, Adjusted R-squared: 0.6393
F-statistic: 1.233e+05 on 1 and 69565 DF, p-value: < 2.2e-16
The results of this regression on just PC1 corroborated that the R-Square is smaller, at a value of 63.93%. The tradeoff between parsimony and description of these two potential models made the choice of model unclear. Assuming the more explanatory model, which accounted for the number of components included by adjusted R Square, was chosen, only one component would be removed. This was not an effective parsing down of variables. However, there would be low bias since only one component was dropped.
K- Means
List of 9
$ cluster : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:2] 1.15e+12 1.34e+12
$ tot.withinss: num 2.5e+12
$ betweenss : num 4.81e+12
$ size : int [1:2] 24086 45481
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3285506 0.4200874 -0.3199683 0.2372900 0.9282858 -0.6150829
2 0.1739950 -0.2224715 0.1694500 -0.1256649 -0.4916051 0.3257379
Office Construction Production Unemployment IncomePerCap
1 -0.01890396 -0.4302842 -0.6556146 -0.5268802 37598.33
2 0.01001123 0.2278715 0.3472029 0.2790272 20114.40
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
(between_SS / total_SS = 65.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:3] 4.33e+11 3.94e+11 3.93e+11
$ tot.withinss: num 1.22e+12
$ betweenss : num 6.09e+12
$ size : int [1:3] 27175 29638 12754
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754
Cluster means:
Hispanic White Black Asian Professional Service
1 0.4318377 -0.5751979 0.3952287 -0.15378102 -0.7689991 0.6127621
2 -0.2397814 0.3399978 -0.2076592 -0.02410908 0.1329131 -0.2111302
3 -0.3629095 0.4354829 -0.3595529 0.38368702 1.3296436 -0.8149861
Office Construction Production Unemployment IncomePerCap
1 -0.02605785 0.2974405 0.51204618 0.6194822 16576.79
2 0.07327428 0.0108065 -0.07894429 -0.3101802 27821.96
3 -0.11475466 -0.6588701 -0.90756657 -0.5991303 42759.54
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 1 3 3 2 2 2 1 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 2 2 3 2 3 1 3 3 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
(between_SS / total_SS = 83.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "1" "2" "3" "4"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
$ tot.withinss: num 7.15e+11
$ betweenss : num 6.6e+12
$ size : int [1:4] 17574 17698 8266 26029
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3016974 0.4066104 -0.28173021 0.1074823 0.5711741 -0.43512866
2 0.6680011 -0.8809881 0.57590840 -0.1651807 -0.9293827 0.83254425
3 -0.3738848 0.4389987 -0.37976189 0.4559152 1.5331982 -0.92442949
4 -0.1317654 0.1850702 -0.08076331 -0.1050413 -0.2406168 0.02128076
Office Construction Production Unemployment IncomePerCap
1 0.06629532 -0.2290380 -0.4319137 -0.46098899 32840.09
2 -0.05484069 0.3137655 0.5749681 0.89883620 14433.73
3 -0.17952039 -0.7693828 -1.0184110 -0.62970642 45780.77
4 0.04953752 0.1856318 0.2240905 -0.09992813 23412.84
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 2 4 4 4 1 4 1 4 4 4 4 4 2 4 1 4 2 1 1 4 4 1 2 4 4
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 4 1 1 3 1 1 4 1 3 4 1 1 4 4 4 4 4 2 2 2 2 2 4 4 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 4 4 2 4 4 4 2 2 2 4 4 4 2 2 4 4 4 2 4 2 4 4
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
(between_SS / total_SS = 90.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 6 4 4 6 6 2 4 2 6 6 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:6, 1:11] -0.349 -0.285 0.974 0.132 -0.385 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:6] 5.33e+10 5.39e+10 6.79e+10 5.20e+10 5.61e+10 ...
$ tot.withinss: num 3.36e+11
$ betweenss : num 6.98e+12
$ size : int [1:6] 8294 13096 9901 16290 4763 17223
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 6 clusters of sizes 8294, 13096, 9901, 16290, 4763, 17223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3494717 0.4296992 -0.3360808 0.30654695 1.0897348 -0.68272359
2 -0.2854601 0.3976771 -0.2665621 0.05425283 0.4263884 -0.36762711
3 0.9739417 -1.2113147 0.7268560 -0.18905224 -1.0768116 1.07158297
4 0.1316581 -0.2288132 0.2195608 -0.13438783 -0.6075451 0.36540166
5 -0.3852568 0.4492850 -0.4005145 0.50197893 1.7117093 -1.02643109
6 -0.1925231 0.2792048 -0.1502203 -0.09190832 -0.1287054 -0.06945889
Office Construction Production Unemployment IncomePerCap
1 -0.029446162 -0.5329557 -0.7839551 -0.5661097 38948.40
2 0.089116820 -0.1457015 -0.3276215 -0.4288697 31179.25
3 -0.088194644 0.3291525 0.5983303 1.2390235 12157.33
4 0.007625546 0.2812103 0.4727567 0.2799036 18933.00
5 -0.254726418 -0.8568239 -1.1023498 -0.6538796 48913.98
6 0.060350087 0.1491981 0.1403863 -0.1974674 24809.23
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
6 4 4 6 6 2 4 2 6 6 6 4 4 4 6 2 6 3 1 1 6 6 2 4 6 6
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
4 4 2 2 1 2 1 4 1 5 6 2 2 6 4 4 6 4 3 4 3 4 3 4 4 4
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
4 4 6 4 4 4 4 4 4 4 4 6 4 4 4 4 4 6 4 4 3 4 6
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 53289943534 53860735274 67861554081 51981016101 56119303347 52399097534
(between_SS / total_SS = 95.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 2 5 7 7 2 2 7 3 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:7, 1:11] -0.396 -0.258 -0.313 1.054 0.237 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:7] 3.01e+10 3.39e+10 3.39e+10 5.06e+10 3.68e+10 ...
$ tot.withinss: num 2.52e+11
$ betweenss : num 7.06e+12
$ size : int [1:7] 3586 13111 9445 8258 13680 6091 15396
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 7 clusters of sizes 3586, 13111, 9445, 8258, 13680, 6091, 15396
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3964506 0.4542494 -0.40764714 0.5282089 1.7925081 -1.0717282
2 -0.2576300 0.3733856 -0.23385201 -0.0269068 0.1755536 -0.2474022
3 -0.3125257 0.4200419 -0.30528640 0.1527982 0.6907586 -0.4893448
4 1.0539699 -1.2761352 0.73515083 -0.1940444 -1.1025060 1.1203953
5 0.2366030 -0.3958078 0.34172930 -0.1373108 -0.7010158 0.4859066
6 -0.3575666 0.4294729 -0.34914245 0.3685170 1.2778561 -0.7833195
Office Construction Production Unemployment IncomePerCap
1 -0.29189615 -0.895033912 -1.1399021 -0.65954880 50244.25
2 0.08656937 -0.003686199 -0.1156669 -0.35053069 28266.51
3 0.06459196 -0.301790762 -0.5299635 -0.49757889 34274.90
4 -0.08602009 0.329565942 0.5903558 1.33577584 11570.50
5 -0.02053170 0.292153979 0.5252348 0.41200162 17787.86
6 -0.07512151 -0.640358239 -0.8939002 -0.59387071 41486.24
[ reached getOption("max.print") -- omitted 1 row ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 5 7 7 2 2 7 3 7 7 7 7 5 5 7 2 7 5 3 3 2 2 2 5 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
5 5 2 2 6 3 3 7 3 6 7 2 3 7 5 5 2 5 4 5 5 5 4 5 7 5
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
5 5 7 5 5 7 7 5 5 5 5 2 5 5 5 7 5 2 5 7 4 5 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 30078367255 33918666154 33924690315 50620627295 36763050021 31979820579
[7] 34290620831
(between_SS / total_SS = 96.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 1 8 7 1 1 5 7 5 1 1 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:8, 1:11] -0.21 -0.398 -0.331 -0.359 -0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:8] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:8] 2.38e+10 2.24e+10 2.47e+10 2.28e+10 2.44e+10 ...
$ tot.withinss: num 1.95e+11
$ betweenss : num 7.12e+12
$ size : int [1:8] 13055 3152 7742 5140 10623 6131 13192 10532
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 8 clusters of sizes 13055, 3152, 7742, 5140, 10623, 6131, 13192, 10532
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.2099021 0.31065143 -0.17687237 -0.08368332 -0.08389187 -0.09881344
2 -0.3981141 0.45578358 -0.40960492 0.53311507 1.81785638 -1.08784127
3 -0.3314624 0.43062774 -0.31765082 0.20067007 0.83132143 -0.55740904
4 -0.3588452 0.42873072 -0.36110010 0.40713285 1.35698554 -0.82340123
5 -0.2764009 0.38868612 -0.25353852 0.02632578 0.34875186 -0.33062038
6 1.1604954 -1.34363273 0.72389507 -0.20529827 -1.11835632 1.17173776
Office Construction Production Unemployment IncomePerCap
1 0.06138421 0.1289094 0.1064936 -0.23047133 25340.47
2 -0.30249732 -0.9092239 -1.1486967 -0.66205029 50788.01
3 0.03930789 -0.3810941 -0.6273035 -0.52118182 35892.73
4 -0.10289042 -0.6828886 -0.9379571 -0.61013558 42677.39
5 0.09089521 -0.1005031 -0.2648395 -0.40948506 30223.86
6 -0.08558404 0.3344531 0.5598170 1.46755772 10698.00
[ reached getOption("max.print") -- omitted 2 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 8 7 1 1 5 7 5 1 1 1 7 7 8 1 5 7 8 3 3 1 1 5 7 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
8 7 5 5 4 5 3 7 3 4 1 5 5 1 7 7 1 7 6 8 8 8 8 7 7 7
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
8 7 7 8 7 7 7 8 7 8 7 1 7 8 8 7 7 1 7 7 6 7 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 23806194051 22351110068 24674184543 22763698227 24425126210 32229516603
[7] 22734152115 22364091140
(between_SS / total_SS = 97.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 7 3 3 7 9 1 3 1 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:9, 1:11] -0.294 -0.347 0.049 -0.402 1.22 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:9] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:9] 1.54e+10 1.52e+10 1.73e+10 1.58e+10 2.42e+10 ...
$ tot.withinss: num 1.55e+11
$ betweenss : num 7.16e+12
$ size : int [1:9] 8025 5905 11753 2719 4994 4354 12230 9140 10447
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 9 clusters of sizes 8025, 5905, 11753, 2719, 4994, 4354, 12230, 9140, 10447
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.29398811 0.4084707 -0.2881384 0.09801206 0.5412472 -0.41887898
2 -0.34732990 0.4326622 -0.3266366 0.26353963 0.9956851 -0.63591035
3 0.04903299 -0.1030093 0.1312846 -0.13421926 -0.5427295 0.27794329
4 -0.40241766 0.4583190 -0.4144673 0.54771279 1.8491328 -1.10706894
5 1.21957721 -1.3726251 0.7042742 -0.21976591 -1.1201758 1.19133726
6 -0.35966388 0.4293714 -0.3692219 0.42709539 1.4300252 -0.86183629
Office Construction Production Unemployment IncomePerCap
1 0.092171984 -0.214882278 -0.42684953 -0.4601316 32552.93
2 -0.003486057 -0.478374937 -0.72844821 -0.5525047 37694.40
3 0.017290823 0.273752980 0.44804925 0.1820710 19710.62
4 -0.310870838 -0.926207144 -1.16439186 -0.6642217 51363.70
5 -0.083922955 0.335218762 0.54026333 1.5507180 10154.85
6 -0.136970585 -0.719826670 -0.97229327 -0.6191718 43867.61
[ reached getOption("max.print") -- omitted 3 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
7 3 3 7 9 1 3 1 7 7 7 3 3 3 7 9 7 8 2 2 9 9 9 3 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
3 3 1 9 2 1 2 3 2 6 7 1 1 7 3 3 9 3 5 8 8 8 8 3 3 3
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
3 3 7 8 3 3 3 8 3 3 3 9 3 8 8 3 3 7 3 3 5 3 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 15416927509 15194151787 17298838728 15761413559 24236871148 16554689443
[7] 17160968891 17348442911 16348725566
(between_SS / total_SS = 97.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 9 2 8 9 5 5 8 10 9 9 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:10, 1:11] 1.293 0.172 -0.361 0.733 -0.269 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:10] 1.67e+10 1.25e+10 1.39e+10 1.17e+10 1.19e+10 ...
$ tot.withinss: num 1.28e+11
$ betweenss : num 7.18e+12
$ size : int [1:10] 3740 9868 3958 7455 8799 2488 5279 10782 10229 6969
$ iter : int 2
$ ifault : int 4
- attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 3740, 9868, 3958, 7455, 8799, 2488, 5279, 10782, 10229, 6969
Cluster means:
Hispanic White Black Asian Professional Service
1 1.29347073 -1.3926502 0.65110535 -0.227144565 -1.10439711 1.2206491
2 0.17193979 -0.2995527 0.27181346 -0.131665132 -0.65475827 0.4157717
3 -0.36067127 0.4329611 -0.37594612 0.433931771 1.47068572 -0.8852975
4 0.73251794 -1.0447516 0.74170741 -0.163014767 -1.02683185 0.9384102
5 -0.26944247 0.3834500 -0.24218980 -0.002215569 0.27358175 -0.2989392
6 -0.40143948 0.4574908 -0.41624907 0.552865396 1.86667764 -1.1156001
Office Construction Production Unemployment IncomePerCap
1 -0.076264666 0.32954885 0.47907560 1.65962676 9446.23
2 -0.005854058 0.28803780 0.50879858 0.34460804 18266.13
3 -0.150092261 -0.74228280 -0.99238368 -0.63110451 44529.44
4 -0.088022816 0.32539421 0.65367907 0.92950639 14162.91
5 0.094618993 -0.05725246 -0.20068842 -0.38834330 29357.81
6 -0.323599903 -0.93488726 -1.16996086 -0.66198939 51688.15
[ reached getOption("max.print") -- omitted 4 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
9 2 8 9 5 5 8 10 9 9 8 8 2 2 9 5 8 4 10 7 9 9 5 2 8 8
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 5 5 7 10 10 8 10 3 8 5 10 8 2 2 9 2 4 2 4 2 4 2 8 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 8 4 2 8 8 2 2 2 8 9 2 4 2 8 2 9 2 8 4 2 8
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 16668900913 12479477177 13910959117 11660948482 11895068060 12674139695
[7] 12952720882 11959674961 11678714620 12133241415
(between_SS / total_SS = 98.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized along the normal distribution to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4,5,6,7,8,9, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 5 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. So a cluster of fourr would be sufficient as it would capture a sufficient amount of the data.
KNN
Preprocessing KNN


Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
[1] "factor"
[1] "[855,2.47e+04]" "(2.47e+04,5.6e+04]"
Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] "factor"
[1] "[855,1.88e+04]" "(1.88e+04,2.47e+04]" "(2.47e+04,3.23e+04]"
[4] "(3.23e+04,5.6e+04]"
'data.frame': 69567 obs. of 12 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ ipc2 : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
$ ipc4 : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
[1] 0
[1] 0
[1] 12
'data.frame': 69567 obs. of 12 variables:
$ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
$ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
$ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
$ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
$ Professional: num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
$ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
$ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
$ Construction: num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
$ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
$ Unemployment: num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
$ ipc2 : Factor w/ 2 levels "[855,2.47e+04]",..: 2 1 1 1 2 2 1 2 1 1 ...
$ ipc4 : Factor w/ 4 levels "[855,1.88e+04]",..: 3 1 2 2 3 3 2 4 2 2 ...
KNN Model
Train-Test split 3:1
KNN 2 categories
Selecting the correct “k”
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
num [1:2, 1:15] 1 0.796 3 0.823 5 ...

Results
Factor w/ 2 levels "[855,2.47e+04]",..: 1 1 1 1 1 2 2 1 1 1 ...
[1] 22836
dat_pred_ipc2
[855,2.47e+04] High
11175 11661
dat_ipc2.testLabels
dat_pred_ipc2 [855,2.47e+04] High
[855,2.47e+04] 9491 1684
High 1941 9720
[1] 22836
[1] 9491 9720
[1] 0.8412594
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
8.412594e-01 6.825271e-01 8.364545e-01 8.459773e-01 5.006131e-01
AccuracyPValue McnemarPValue
0.000000e+00 2.119375e-05
Sensitivity Specificity Pos Pred Value
0.8302134 0.8523325 0.8493065
Neg Pred Value Precision Recall
0.8335477 0.8493065 0.8302134
F1 Prevalence Detection Rate
0.8396514 0.5006131 0.4156157
Detection Prevalence Balanced Accuracy
0.4893589 0.8412730
KNN 4 categories
Selecting the correct “k”
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
num [1:2, 1:15] 1 0.563 3 0.597 5 ...

Results
Factor w/ 4 levels "[855,1.88e+04]",..: 1 2 2 1 2 3 3 1 1 2 ...
[1] 22836
dat_pred_ipc4
[855,1.88e+04] Mid-Low (2.47e+04,3.23e+04] (3.23e+04,5.6e+04]
5286 5886 5760 5904
dat_ipc4.testLabels
dat_pred_ipc4 [855,1.88e+04] Mid-Low (2.47e+04,3.23e+04]
[855,1.88e+04] 4120 982 165
Mid-Low 1299 2996 1422
(2.47e+04,3.23e+04] 238 1510 2923
(3.23e+04,5.6e+04] 66 221 1223
dat_ipc4.testLabels
dat_pred_ipc4 (3.23e+04,5.6e+04]
[855,1.88e+04] 19
Mid-Low 169
(2.47e+04,3.23e+04] 1089
(3.23e+04,5.6e+04] 4394
[1] 22836
[1] 4120 2996 2923 4394
[1] 0.6320284
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
6.320284e-01 5.093864e-01 6.257352e-01 6.382879e-01 2.510510e-01
AccuracyPValue McnemarPValue
0.000000e+00 1.874250e-19
Sensitivity Specificity Pos Pred Value
Class: [855,1.88e+04] 0.7199021 0.9318647 0.7794173
Class: Mid-Low 0.5247854 0.8312606 0.5090044
Class: (2.47e+04,3.23e+04] 0.5098552 0.8341227 0.5074653
Class: (3.23e+04,5.6e+04] 0.7748193 0.9120303 0.7442412
Neg Pred Value Precision Recall F1
Class: [855,1.88e+04] 0.9086610 0.7794173 0.7199021 0.7484785
Class: Mid-Low 0.8399410 0.5090044 0.5247854 0.5167745
Class: (2.47e+04,3.23e+04] 0.8354416 0.5074653 0.5098552 0.5086574
Class: (3.23e+04,5.6e+04] 0.9245807 0.7442412 0.7748193 0.7592225
Prevalence Detection Rate Detection Prevalence
Class: [855,1.88e+04] 0.2506131 0.1804169 0.2314766
Class: Mid-Low 0.2500000 0.1311964 0.2577509
Class: (2.47e+04,3.23e+04] 0.2510510 0.1279996 0.2522333
Class: (3.23e+04,5.6e+04] 0.2483360 0.1924155 0.2585391
Balanced Accuracy
Class: [855,1.88e+04] 0.8258834
Class: Mid-Low 0.6780230
Class: (2.47e+04,3.23e+04] 0.6719889
Class: (3.23e+04,5.6e+04] 0.8434248
Selecting the correct “k”
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
num [1:2, 1:15] 1 0.563 3 0.597 5 ...

Regressions
[1] 15 100

the dataset had 10 dependent variables predicting the independent variable. Ridge regression was introduced as it minimized the residual sum of squares and has a shrinkage penalty of lambda times by the sum of squares of the coefficients. Overall as lambda increases, the coefficients apprach zero. this plot indicates the entire path of variables as they shring towards zero. To build the ridge regression, a log sacle grid for the lambda values was constucted from 10^10 to 10^-2 in 100 segments.
Train and Test sets
To avoid introducing a bias in developing the Ridge and Lasso regression a train and test data set were introduced. To simulate a train and test set there was a random split into 50% for the train set.

[1] 824.8974
lowest lamda from CV: 824.8974
MSE for best Ridge lamda: 12154666
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
19143.7093 -300.1618 436.8876 -34.8267 267.4880 1313.3468
Service Office Construction Production Unemployment
-988.9918 -358.9968 -465.5783 -707.8713 -806.0590
R^2:
[1] 0.8843423
Lasso
In order to be the best model for Ridge regression, cross validation was implimented to find the best fit. The cross validation line graph indicates that a model with ten dependent variables would yield the lowest lambda with the lowest mean square error. As the lambda value decreases, the mean square error also decreases. Overall, Ridge Regression includes all the of the dependent variables and the best value for lambda is indicated by the first vertical line. The lowest lamda from the cross validation was found to be 825. The MSE for the best Ridge Lambda equation was 30834392. from the equation, the model that had the most positive coefficient valus were professional at 3248, white at 1104 and asian at 546. the values that had the strongest negative coeffiecents were service at -2195, production at -2021 and unemployment at -1602. It was interesting to note that only professional had a positive lambda while the other work variables were all negative. The R^2 value for the best Ridge model was found to be 0.707. this means that 70.7% of the variation in the income can be explained by the model.


lowest lamda from CV: 25.78395
MSE for best Lasso lamda: 11672982
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
17484.872000 -212.822777 219.903152 0.000000 165.286025 1987.274255
Service Office Construction Production Unemployment
-284.732831 8.180852 0.000000 -4.656966 -619.976123
The non-zero coefficients :
(Intercept) Hispanic White Asian Professional Service
17484.872000 -212.822777 219.903152 165.286025 1987.274255 -284.732831
Office Production Unemployment
8.180852 -4.656966 -619.976123
[1] 0.8889258
Lasso regression was also implimented to see if this model would perform differently from the regression or Ridge model. Lasso regression can be useful in reducing over-fittness and assist in model selection. from the line plot it can be seen that the three most positive coefficient values are professional at 6030.2, white at 1690, and asian at 712.2. This means that professional, white and asian have much stronger positive pull on the data that the other variables. The three most negtive coefficient values are unemployment at -1613, service at -716.5, and producton at -622.3. Construction was found to have a coefficient value of 0.0 so it was removed for the final Lasso model. It is interesting to note that the lambda values for hispanic are small at 13.6 so they do not deviate much from the ordinary least squares model (OLS).
Cross validation was introduced to select the lambda value with the lowest MSE. The CV recommended eight dependent variables be used to predict income. The Lasso regresison recommended that construction be removed from the equation. the Cross validaiton value was found to be 16.2 and the MSE for the best Lasso model was 30709528. Also the r^2 value was found to be 0.708. this means that 70.8% of the variation in Income can be explained by the model.
Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-27748.0 -1894.5 -20.2 1770.9 18988.4
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17391.74 36.71 473.707 < 2e-16 ***
Hispanic 104.54 54.22 1.928 0.0539 .
White 669.66 72.93 9.183 < 2e-16 ***
Black 334.20 52.11 6.414 1.43e-10 ***
Asian 326.28 25.82 12.635 < 2e-16 ***
Professional 1077.33 2706.77 0.398 0.6906
Service -814.52 1609.32 -0.506 0.6128
Office -358.87 1173.52 -0.306 0.7598
Construction -378.99 1193.57 -0.318 0.7508
Production -515.94 1505.74 -0.343 0.7319
Unemployment -616.00 17.07 -36.087 < 2e-16 ***
ipc2High 19892.01 62.25 319.559 < 2e-16 ***
ipc4Mid-Low 5181.26 42.82 120.998 < 2e-16 ***
ipc4(2.47e+04,3.23e+04] -9860.07 41.82 -235.757 < 2e-16 ***
ipc4(3.23e+04,5.6e+04] NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3412 on 69553 degrees of freedom
Multiple R-squared: 0.8893, Adjusted R-squared: 0.8892
F-statistic: 4.296e+04 on 13 and 69553 DF, p-value: < 2.2e-16
Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian +
Professional + Service + Office + Production + Unemployment,
data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-57889 -3155 -139 3092 39315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.46 <2e-16 ***
Hispanic 972.98 87.56 11.11 <2e-16 ***
White 2983.19 117.35 25.42 <2e-16 ***
Black 1175.89 84.20 13.97 <2e-16 ***
Asian 1115.17 41.58 26.82 <2e-16 ***
Professional 5991.86 53.93 111.11 <2e-16 ***
Service -737.90 39.77 -18.55 <2e-16 ***
Office 359.14 28.63 12.54 <2e-16 ***
Production -667.56 40.62 -16.44 <2e-16 ***
Unemployment -1604.14 26.65 -60.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.894e+04 on 9 and 69557 DF, p-value: < 2.2e-16
MSE for full model :
[1] 11638291
MSE for full model (w/o construction) :
[1] 30460435
An OLS model was consturcted for both the full model and the full model without the construction variable to compare them to the Ridge and Lasso models. the R^2 value for both the OLS models was found to be 0.71. this means that both the ordinary least squares models explain 71% of the variation in income can be explainedby the model. Furthermore the MSE for the full model was found to be 30459848. The full model withouth the construction variable was found to have a larger MSE at 30460435. Overall the Lasso, Ridge, and both OLS models explain aobut roughly the same amount of variability in the data. Also all of the R^2 values are about the same around 0.70. Since the full OLS has the lowest MSE and the highest R^2 it would be a more suitable option than the Ridge, Lasso, or OLS without construction.